在python程式首先一開始載入相關模組與套件,和指定目標網址
URL = "https://munchery.com/"
建立Python爬蟲類別DishesSpider
class DishesSpider():
der__init__(self, url):
self.url_to_crawl = url
self.all_items =[["名稱","網址","圖片"]
def start_driver(self):
print("啟動 WebDriver...")
self.driver = webdriver. Chrome (" ./chromedriver")
self.driver.implicitly_wait(10)
上述類別建構初始網址 url_to_crawl和回傳食譜項目all_items,然後用start_driver()函數啟動WebDriver。
再來是關閉WebDriver的close_driver()函數和取得網頁的get_page()函數:
def close_driver (self):
self.driver.quit()
print("關閉 webDriver...")
def get_page (self, url) :
print ("取得網頁...")
self.driver.get(url)
time.sleep(2)
def login (self) :
print("登入網站...")
try:
form = self.driver.find_element_by_xpath ('//*[...]')
email = form.find_element_by_xpath('.//*[...]')
email.send_keys ('hueyan@ms2.hinet.net')
zipcode = form.find_element_by_xpath('.//*[...]')
zipcode.send_keys('12345')
button = form.find_element_by_xpath (' .//button [...]')
button. click ()
print("成功登入網站...")
time. sleep (5)
return True
except Exception:
print("登入網站失敗...")
return False
上述login()函數是登入網站,在取得表單元素後,再取得電子郵件和郵遞區號的HTML欄位,即可送出資料,最後按下按鈕來顯示食譜。
grab_dishes()函數爬取範例的食譜項目,在取得每一道食譜的<div>
標籤後,使用 for/in 迴圈呼叫 process_item()函數來取得食譜真目的資訊。
def grab_dishes (self):
print ("開始爬取食譜項目...")
for div in self.driver. find_elements_by_xpath ('//a[...]'):
item = self. process_item (div)
if item:
self.all_items.append (item)
def process_item (self, div):
item = []
try:
url = div.get_attribute ("href")
image = div.find_element_by_xpath (
'.//img[...]').get_attribute ("src")
title = div. find_element_by_xpath ('.//div[...]').text
item = [title, image, url]
return item
except Exception:
return False
上述 process.iem()函數使用 tr/except依序取得食譜的網址、食譜圖片和食譜名稱。
底下的 parse_dishes()函數就是呼叫上述函數來爬取食譜資訊的項目:
def parse_dishes (self) :
self.start_driver () #開啟 WebDriver
self.get_page (self.url_to_crawl)
if self.login() : #是否成功登入
self.grab-dishes() #爬取食譜
self.close_driver() #關閉 WebDriver
if self.all items:
return self.all_items
else:
return []